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Abstract 



We describe an algorithm to quantify dependence in a multivariate data set. The algorithm is able to identify any linear and non- 
linear dependence in the data set by performing a hypothesis test for two variables being independent. As a result we obtain a 
reliable measure of dependence. 

In high energy physics understanding dependencies is especially important in multidimensional maximum likelihood analyses. 
fNJ We therefore describe the problem of a multidimensional maximum likelihood analysis applied on a multivariate data set with 
variables that are dependent on each other. We review common procedures used in high energy physics and show that general 
qj dependence is not the same as linear correlation and discuss their limitations in practical application. 

Finally we present the tool CAT, which is able to perform all reviewed methods in a fully automatic mode and creates an analysis 
report document with numeric results and visual review. 
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1. Introduction 

This paper describes an algorithm for quantifying dependen- 
cies in a multivariate data set. Throughout this paper we will, 
in contrast to common jargon, strictly speak of correlation only 
in the context of linear correlation, whereas dependence is used 
for general, linear and also non-linear, correlation. Understand- 
ing dependencies is especially useful and necessary in mul- 
tidimensional likelihood analysis, a technique widely used in 
high energy physics (HEP). Such analysis entails constructing 
a probability density function (PDF) describing the multivari- 
ate data set. In many analyses dependencies among different 
variables are neglected in the PDF. It is required to somehow 
prove that neglecting the dependencies is a valid procedure as 
e. g. they are small. 

In section |2] a brief introduction of the maximum likelihood 
method is given to illustrate the problems that arise from a data 
set with variables that are not independent. Sections [3] and [4] 
will review existing methods and discuss their limitations. In 
section [5] a new algorithm for quantifying dependence is ex- 
plained and section [6]presents CAT, a fully automatic analysis 
tool. Section [7] will briefly outline which possibilities exist to 
deal with dependencies in the data set. 



2. Maximum likelihood analysis 

Consider an unbinned extended maximum likelihood analy- 
sis of a data set with events of different categories c (e. g. signal 



and background). The log-likelihood function is expressed as: 



lnX = 2 In i 2 WW* \~Ts Ni ' 



_ (1) 

j=i W=i ' 
where 

• N is the total number of events in the data set, 

• N c is the number of different categories in the data set, 

• Ni is the expected number of events for the i th category, 

• P; is the PDF for the / th category, 

• xj is the n-dimensional vector of variable values for the / h 
event. 

In the analysis the log-likelihood is maximized by changing 
the Ni yields to extract the most likely set. If x has more than 
one dimension, one usually speaks of a multidimensional anal- 
ysis. 

The crucial point of a maximum likelihood analysis is to 
choose the model properly. Such model might be either pro- 
vided by theory or must be derived from simulated data and 
sideband studies. The latter is a common practice in HEP. In 
case of a multidimensional analysis the model must also de- 
scribe the dependencies among different variables correctly. If 
no theoretical model exists, e. g. for combinatorial background 
components, experimentalists usually start by describing the n- 
dimensional PDF as a product of marginal distributions: 
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This procedure is entirely valid with no dependencies between 
different variables. Indeed equation[2]is the definition of inde- 
pendence among variables. 

If such a model Pj shall be used for the i* category, the N, 
events must have no dependencies among the different vari- 
ables. It is the task of the experimentalist to prove that this 
assumption is valid and it is the aim of the following sections to 
provide assistance. 

3. Linear correlation coefficient 

One often used quantity to describe dependence among two 
variables x and y is the linear correlation coefficient r. For a 
given sample of N events, it can be computed from the data by 



S£i(*.--*)(y/-y) 



(3) 



where x — 4 x i andy = h Ji correspond to the sample 
mean. The values of r are within the interval [-1,1], where 
r = 1(-1) corresponds to 100% (anti-)linear correlation, r = 
corresponds to no linear correlation. Figure 1(a) shows an 



example of two variables with no linear correlation and figure 
|l(b)| shows an example of two variables with linear correlation. 

In general, it is not possible to conclude from the absence of 
linear correlation that two variables are independent. For exam- 
ple in case of two variables that follow a circular distribution, 
thus x = r ■ cos <p and y = r ■ sin <p, the linear correlation coeffi- 
cient is zero (see figure [T(c)| . 

In HEP practice one should keep this limitation in mind as 
e. g. angular distributions can show a very small correlation co- 
efficient to other variables but are not necessarily independent. 

4. Projections in subranges 

To address the problem of dependencies between variables a 
common method in HEP is to look at projections of one vari- 
able in subranges of the other. In figure [2] three examples of 
this method are shown, using the same data sets that were in- 
troduced in figure [T] In case of independent variables the three 
projections follow the same distribution. However, in general 
this method does not allow to conclude independence. One has 
to be aware of symmetry axes in the distribution. By choosing 



two bins with y > and y < instead of three, figure 2(c) would 



lead to two similar distributions. By using an adequate number 
of bins this problem can be avoided in practical applications. 

Another problem in practice is, that it might be hard to judge 
whether two variables are independent or not. Distributions 
might be very similar and compatible with each other within 
uncertainties or not. Statistical tests might be necessary to esti- 
mate their compatibility. In case of more than two variables it 
is also difficult to compare dependence and, e. g., sort them by 
their importance. The latter might be necessary to judge which 
dependencies should be described by a conditional PDF to im- 
prove the model. As these days multidimensional analyses with 
four, five or even more dimensions are becoming an important 
method, a reliable automatic procedure is desired. 



5. Hypothesis test for independence 

Whereas the linear correlation coefficient is a quantitative 
measure of linear correlation, it can not be used to identify gen- 
eral dependence. On the other hand, projections in subranges 
can identify dependence but are difficult to compare or quantify 
without additional work. 

5.1. Copulas 

Copulas have been introduces in 1959 by Sklar to describe 
how a joint distribution function couples to its margins. Sklar' s 
theorem states: 

Let S be a joint distribution function with margins F and G. 
Then there exists a copula C such that for all x,y in R, 



S(x,y) = C(F(x),G(y)). 



(4) 



If F and G are continuous, then C is unique; otherwise, C is 
uniquely determined on RanF X RanG. Conversely, if C is a 
copula and F and G are distribution functions, then the function 
S defined by equation Q is a joint distribution function with 
margins F and G. 

Sklar's theorem and more details on copulas can be found in 
[T]. A special copula is the unit copula C(u, v) — u x v, which 
connects the marginal distributions of independent variables, as 
can be seen from equation ([2]). 

5.2. Hypothesis test for independence 

We therefore present an algorithm that performs a test of the 
hypothesis whether in a given data set with N events, two vari- 
ables x and y are independent. 

1. Determine the probability integral transforms u = F(x) 
and v = G(y) of variables x and y. First sort the data in 
x and y. The values of u — I/N (v = J/N), where I( J) is 
the index of x(y) in the sorted range, respectively, are then 
within the interval [0,1]. This is sometimes referred to as 
flattening the distribution. 

2. Create a n x n histogram H(u, v) with bins of equal size 
and fill it with all events. The number of bins n should 
be chosen such that N/n 2 is large enough (> 25). H(u, v) 
corresponds to the empirical copula density. 

3. In each bin of H(u, v), if x and y are independent, we ex- 
pect e = N/n 2 entries and the statistical uncertainty can be 
approximated by cr e - ^N/n 2 if the binning was chosen 
as suggested in step 1 . 

4. Compute the^- 2 = £" =I £" =] ^3^, where h uj is the con- 
tent of the (i, j) th bin of H(u, v). 

5. The probability of the data being consistent with a flat hy- 
pothesis and thus x and y being independent variables fol- 
lows ax 2 distribution with n 2 - (In - 1) degrees of free- 
dom. By construction the number of degrees of freedom is 
reduced by (In - 1) due to the flatness of the two marginal 
distributions. 
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Figure 2: Normalized projections on variable x in three different subranges of variable y for the three data sets shown in figure^ 



In short, the algorithm performs a test of H(u, v) being con- 
sistent with the constant density c(u, v) = 3 expected from 
the unit copula. The algorithm is able to identify any linear 
or non-linear dependence. The probability of the hypothesis 
can easily be compared among different pairs of variables in a 
multivariate data set with more than two variables. It also can 
be translated into the unit of standard deviations significance 
for the hypothesis that x and y are independent. See the section 
about significance tests in [2, chap. 36.2.2]. Examples of the re- 
sulting deviations from a flat distribution for histogram H(u, v) 
are shown in figure[3]for the data sets introduced in figure[T] 

The algorithm is very robust and delivers reliable results no 
matter whether variable values are located on a small interval 
or reach over several orders of magnitude as it is based on rank 
statistics. 

Another feature of this algorithm is the fact that its output 
scales with the size of the data set. A dependence might be 
negligible for low statistics but significant for higher statistics. 
Imagine for example a chessboard like distribution. Neither 
the algorithm nor the maximum likelihood fit will be sensitive 



to this dependence with low statistics and a simple product of 
marginal distributions will describe the data. With increasing 
statistics this dependence will become more and more signif- 
icant as the size of the bins decreases. Also the fit model will 
have to be adjusted once the dependence reaches a certain level. 

5.3. Practical application in HEP 

In practical HEP application of a multidimensional maxi- 
mum likelihood analysis the output of the algorithm offers the 
experimentalist a reliable quantity for supporting the decision 
to choose a simple product approach in the construction of the 
PDF. 

To verify that the approach is reasonable, a simulated data 
set with the same statistics as the real data set can be checked 
for any significant (> 5<x) or evident (> 3cr), if conservative, 
dependence. If available, e. g. for signal events, a larger sim- 
ulated data set with 10 times the statistics could be checked to 
not have any significant dependencies. What can be done in 
case of dependencies will be briefly discussed in section|7] 
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(a) Probability p = 0.1995 corresponds to approx- (b) Probability p < 10 15 corresponds to more (c) Probability p < 10 15 corresponds to more 
imately 1.3cr significance. than 8<r significance. than 8<x significance. 



Figure 3: Deviation in units of tr e for the histogram H(u, v) from a flat distribution for the three data sets shown in figure [T] The axis labels correspond to the 
untransformed (original) values of x and y, which allow for a simpler interpretation than the values in u and v. Resulting probabilities for the distribution being 
consistent with a flat distribution and transformation in units of standard deviations given below. 



It is however not recommended to check simulated data sets 
with e.g. 100 or 1000 times the statistics of real data, as it 
is sometimes available for signal events. Dependencies, which 
become significant only with these statistics, are negligible for 
a maximum likelihood analysis on real data statistics. Further- 
more at such high statistics it might be questionable if the simu- 
lation has the proper level of accuracy to describe dependencies 
to that detail. 

6. CAT - A correlation analysis tool 

A careful study of dependencies requires a non negligible 
amount of work. As we have shown, simple and fast methods 
such as the linear correlation coefficient, do not deliver a re- 
liable result. We therefore developed a fully automatic tool, 
CAT, that performs an analysis for a given multivariate data 
set. Including such tool into the work-flow of a multidimen- 
sional maximum likelihood analysis could significantly shorten 
the amount time, which is necessary to understand the data sam- 
ple. Currently the following methods, which partially have been 
discussed in this paper, are included: 

1 . Linear correlation coefficient 

2. Profile plot of variable x vs. variable y and vice versa 

3. Projections of variable x in subranges of variable y and 
vice versa 

4. Hypothesis test of variable x and y being independent 

For a given data set with n variables all methods are com- 
puted for all pairs of variables automatically. An analysis report 
file is created, which provides a nice visual review and numeric 
results. 

CAT can be downloaded from [3|. As input a comma sep- 
arated value (CSV) file is used as such file can be produced 
easily from any type of user data format. A script to transform 
data from a flat ROOT |4| tuple to CSV is provided as this is 



expected to be the most common case for application in HEP. 
Beside this a script to generate some example random data sets 
with different dependencies is provided. CAT is licensed under 
theGPLv3 0. 

7. How to deal with dependencies? 

Unfortunately, sometimes a product PDF is not a valid ap- 
proach. Assuming three variables x,y and z and a significant 
dependence between x and y, there are different possibilities. 
One simple possibility is of course to remove either x or y 
from the maximum likelihood analysis and perform e. g. a 
simple cut on it. A more complicated approach would be to 
perform the maximum likelihood analysis in bins of either x 
or y. The latter can also be a first step to understand the de- 
pendence better and to finally describe the probability den- 
sity function as conditional PDF and thus the model becoming 
P(x, y, z) - V(x\y) x V(y) x P(z). Whichever method is chosen, 
dealing with dependencies can be a more complicated problem 
than identifying them. Even more important it is to be able to 
show that neglecting dependencies is a valid approach. 

8. Possible applications beyond maximum likelihood fits 

In this paper we compare the empiric copula density against 
the expected density from the unit copula to search for depen- 
dence. In principle comparisons of the empiric copula density 
can also be made against other copulas, which e. g. describe 
the expected density from standard model physics. Such an ap- 
proach does not require any model assumptions about possible 
physics beyond the standard model. Similar approaches have 
e. g. been made with the Sleuth algorithm in |6|. 

Another possible application is the identification of good in- 
put variables for multivariate methods. A high dependence be- 
tween the target variable and input variables is usually desired. 
Depending on the problem it might also be that variables that 
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have a large dependence on a certain variable shall be excluded 
such that the multivariate method can not influence this spe- 
cific variable. A multivariate method should for example not 
produce a peak in the mass distribution of background events, 
which can be avoided by removing variables that have a strong 
dependence on the mass. A widely used multivariate data anal- 
ysis package is TMVA 0, which is included in ROOT. Beside 
this, the NeuroBayes [8 1 package, which was developed in HEP, 
has also found wide application among different experiments. 
A general review of multivariate methods and applications in 
HEP can be found in 0. 

9. Conclusion 

We have presented an algorithm that is able to quantify de- 
pendencies in multivariate data sets. The algorithm is able to 
deliver a reliable measure of dependence for supporting the 
product approach in multidimensional likelihood analyses. We 
have shown how to interpret its result in practice and we ex- 
pect it to be a very useful method as these days more and more 
complicated and multidimensional analyses are carried out in 
HEP. 

In addition a fully automatic tool, CAT, was presented that 
performs a comprehensive analysis for a given multivariate data 
set and creates an analysis report. 
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